Add some additional information to customize the knitted document:
date: "October 01, 2020"
output:
html_document:
number_sections: yes
theme: cerulean
toc: yes
toc_depth: 5
toc_float: yes
pdf_document:
toc: yes
toc_depth: '5'
This will add a table of contents (toc) and will change the colors (theme: cerulean)
To find your favorite Rmarkdown theme: https://www.datadreaming.org/post/r-markdown-theme-gallery/
knitr::opts_chunk$set(cache=TRUE, fig.path='figures/', fig.width=8, fig.height=5 )
This saves all figures in the directory figures and sets the default figure size
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
Rmarkdown Cheatsheet: https://rmarkdown.rstudio.com/lesson-15.html
“#” hash signs indicate headers.
The number of hashes equals the header level.
placing a single asterisk on either side of a phrase makes it italic.
double asterisks make a word or phrase bold.
triple asterisks make a word or phrase bold and italic.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
Execute this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.
You can also embed plots, for example:
(Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.)
echo =FALSE will only display the output, not the code.
Some more chunk options: * Use echo=FALSE to avoid having the code itself shown. * Use results="hide" to avoid having any results printed. * Use eval=FALSE to have the code shown but not evaluated. * Use warning=FALSE and message=FALSE to hide any warnings or messages produced. * Use fig.height and fig.width to control the size of the figures produced (in inches).
naming chunks = good practice (the above chunk was named pressure) * helps navigate around the document & this is what the figures will be named
(check the Rproject directory after knitting)
You can also include images from your local computer or from the web:
Can type out tables:
| col name | ||
|---|---|---|
| 1 | 1 | 1 |
| 2 | 2 | 2 |
Alternatively, you can use the knitr package to make mardown tables from data frames:
| speed | dist |
|---|---|
| 4 | 2 |
| 4 | 10 |
| 7 | 4 |
| 7 | 22 |
| 8 | 16 |
| 9 | 10 |
left, right, center adjust
When you knit the file, an HTML file containing the code and output will be saved alongside it (click the Knit button or press Cmd+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor (Viewer tab).
Rproject Benefits:
No need to set the working directory. All paths are relative to the directory containing the Rproject.
Whenever you open your project, the working directory is automatically set to where your project is. This means your code will not break when you work on a different computer.
RStudio projects allow you to open multiple projects at the same time with each open to its own project directory. This allows you to keep multiple projects open without them interfering with each other.
Good organization / project lay out will:
Project Management tips:
resultssrc directoryfig1_pca_communitycomposition.jpg not Rplot1.jpg)ln -s)data for this workshop
following good project management practices, make a new directory called data and download the data we will be playing with in this workshop into that directory:
In terminal tab:
mkdir data
cd data
wget https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv
curl
We will use the data later, but we can get a general sense of the data by looking at it in the terminal, which will help us decide how to load it into R later:
wc -l gapminder_data.csv
head gapminder_data.csv
cd -
go to your GitHub account and make a new repository DO NOT initialize with a README
follow the instructions on the next page
(in terminal tab)
echo "# SkillPill_ReproducibleR" >> README.md
git init
git add README.md
git commit -m "first commit"
git remote add origin https://github.com/maggimars/SkillPill_ReproducibleR.git
git push -u origin master
README.md is a markdown file, just like this Rmarkdown file in many ways- uses similar syntax.
try also adding your data directory to your Github repository!
Alternatively - you can use the Rstudio interface to version control with Git https://swcarpentry.github.io/git-novice/14-supplemental-rstudio/
(I prefer command line)
?function_name
If you can’t really remember a function name ??function_name
pro-tip From within the function help page, you can highlight code in the Examples and hit Ctrl+Return to run it in RStudio console. This is gives you a quick way to get a feel for how a function works.
?kable
for special operators use quotes, e.g. ?"<-" Without any arguments,vignette()will list all vignettes for all installed packages;vignette(package=“package-name”)will list all available vignettes for package-name, andvignette(“vignette-name”)will open the specified vignette. And then there is always google. # Reproducible and Streamlined Analyses (Day 2) ## Exploring the sample data We already looked at the sample data in Terminal and saw that it was a.csv` file with 1705 lines and that it does have a header.
gapminder<- read.csv("data/gapminder_data.csv", header = TRUE)
View data in another tab with View()
gapminder<- read.csv("data/gapminder_data.csv", header = TRUE)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Lesson Materials: http://swcarpentry.github.io/r-novice-gapminder/14-tidyr/index.html
CheatSheet: https://github.com/rstudio/cheatsheets/blob/master/data-import.pdf
be aware: Gather/Spread has been renamed pivot_longer / pivot_wider
Extra information:
vignette("pivot")
## starting httpd help server ... done
Long v. Wide
SO what is this all about?
Long and wide dataframe layouts mainly affect readability. For humans, the wide format is often more intuitive since we can often see more of the data on the screen due to its shape. However, the long format is more machine readable and is closer to the formatting of databases. The ID variables in our dataframes are similar to the fields in a database and observed variables are like the database values.
Researchers often want to reshape their dataframes from ‘wide’ to ‘longer’ layouts, or vice-versa. The ‘long’ layout or format is where: * each column is a variable * each row is an observation In the purely ‘long’ (or ‘longest’) format, you usually have 1 column for the observed variable and the other columns are ID variables.
For the ‘wide’ format each row is often a site/subject/patient and you have multiple observation variables containing the same type of data. These can be either repeated observations over time, or observation of multiple variables (or a mix of both). You may find data input may be simpler or some other applications may prefer the ‘wide’ format. However, many of R’s functions have been designed assuming you have ‘longer’ formatted data. (Especially ggplot!)
library(tidyr)
Question:
Is gapminder a purely long, purely wide, or some intermediate format?
Using pivot:
Until now, we’ve been using the nicely formatted original gapminder dataset, but ‘real’ data (i.e. our own research data) will never be so well organized. Here let’s start with the wide formatted version of the gapminder dataset.
Challenge:
Download this dataset into your data directory: https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_wide.csv
Then read the data into R and name the dataframe gap_wide:
or
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
#Added 'as.data.frame' to avoid problems associated with the data.table format it comes in
gap_wide <- fread("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_wide.csv", header=TRUE)
That gives us this:
and we want to practice pivoting longer with pivot_longer():
gap_long <- gap_wide %>%
pivot_longer(
cols = c(starts_with('pop'), starts_with('lifeExp'), starts_with('gdpPercap')),
names_to = "obstype_year", values_to = "obs_values"
)
str(gap_long)
## tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
## $ continent : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
## $ country : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
## $ obstype_year: chr [1:5112] "pop_1952" "pop_1957" "pop_1962" "pop_1967" ...
## $ obs_values : num [1:5112] 9279525 10270856 11000948 12760499 14760787 ...
can also use “-” syntax!
using separate
gap_long <- gap_long %>% separate(obstype_year, into = c('obs_type', 'year'), sep = "_")
gap_long$year <- as.integer(gap_long$year)
Challenge: Using gap_long, calculate the mean life expectancy, population, and gdpPercap for each continent. Hint: use the group_by() and summarize() functions
gap_long %>%
group_by(continent, obs_type) %>%
summarize(mean_obs = mean(obs_values))
## # A tibble: 15 x 3
## # Groups: continent [5]
## continent obs_type mean_obs
## <chr> <chr> <dbl>
## 1 Africa gdpPercap 2194.
## 2 Africa lifeExp 48.9
## 3 Africa pop 9916003.
## 4 Americas gdpPercap 7136.
## 5 Americas lifeExp 64.7
## 6 Americas pop 24504795.
## 7 Asia gdpPercap 7902.
## 8 Asia lifeExp 60.1
## 9 Asia pop 77038722.
## 10 Europe gdpPercap 14469.
## 11 Europe lifeExp 71.9
## 12 Europe pop 17169765.
## 13 Oceania gdpPercap 18622.
## 14 Oceania lifeExp 74.3
## 15 Oceania pop 8874672.
Going in the other direction … (time dependent)
Cheatsheet: https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
Cook Book: http://www.cookbook-r.com/Graphs/
Plotting can be for exploration or for sharing. Often what a plot looks like will depend on why you are making it (i.e. exploration v. sharing).
Explore:
Share:
Question: What would be different in the you you make a plot based on whether it is for exploring or sharing?
Since this minicourse is about reproducible analyses for sharing and publishing - we will work on making sharable plots.
library(ggplot2) # we already loaded, but just incase
GG = the grammar of graphics
ggplots are built in layers (same concept as illustrator)
1st layer:
ggplot()
the “canvas” so to speak
2nd layer:
ggplot(gapminder, aes(x=year, y = gdpPercap))
gapminder %>%
ggplot(aes(x=year, y=gdpPercap))
The aesthetics (based on the data)
3rd layer:
ggplot(gapminder, aes(x=gdpPercap, y = lifeExp)) +
geom_point()
Challenge: Modify the example so that the figure shows how life expectancy has changed over time
#Change the x=
ggplot(gapminder, aes(x=year, y = lifeExp)) +
geom_point()
#make points more transparent
ggplot(gapminder, aes(x=year, y = lifeExp)) +
geom_point(alpha = 0.5)
#de-agregate the points
ggplot(gapminder, aes(x=year, y = lifeExp)) +
geom_jitter(alpha = 0.5)
#de-agregate the points with less spread
ggplot(gapminder, aes(x=year, y = lifeExp)) +
geom_jitter(alpha = 0.5, width = 1)
In the previous examples and challenge we’ve used the aes function to tell the scatterplot geom about the x and y locations of each point. Another aesthetic property we can modify is the point color. Modify the code from the previous challenge to color the points by the “continent” column. What trends do you see in the data? Are they what you expected?
ggplot(gapminder, aes(x=year, y = lifeExp, color = continent)) +
geom_jitter(alpha = 0.5, width = 1)
#Two ways to do it
ggplot(gapminder, aes(x=year, y = lifeExp)) +
geom_jitter(alpha = 0.5, width = 1, aes(color = continent))
https://github.com/karthik/wesanderson
https://cran.r-project.org/web/packages/jcolors/vignettes/using_the_jcolors_package.html
https://www.r-graph-gallery.com/38-rcolorbrewers-palettes.html
https://github.com/dill/beyonce
ggplot(gapminder, aes(x=year, y = lifeExp)) +
geom_boxplot(aes(continent))
bar, box, violin, text, line, smooth (lm or loess) …. order matters
geom_hline(yintercept, linetype, color, size)
alpha() jitter() scale_x_log10()
ggplot(gapminder, aes(x=year, y = lifeExp)) +
geom_boxplot(aes(continent))
ggplot(gapminder, aes(x=year, y = lifeExp)) +
geom_violin(aes(continent))
ggplot(gapminder, aes(x=year, y = lifeExp)) +
geom_violin(aes(continent)) +
geom_point(aes(x=year, y = lifeExp))
labs(
x = "Year", # x axis title
y = "Life expectancy", # y axis title
title = "Figure 1", # main title of figure
color = "Continent" # title of legend
)
More on Titles, subtitles, and captions: https://www.datanovia.com/en/blog/ggplot-title-subtitle-and-caption/
ggplot(gapminder, aes(x=gdpPercap, y = lifeExp)) +
geom_point() + theme_bw()
ggplot(gapminder, aes(x=gdpPercap, y = lifeExp)) +
geom_point() + theme_classic()
ggplot(gapminder, aes(x=gdpPercap, y = lifeExp)) +
geom_point() + theme_test()
ggplot(gapminder, aes(x=gdpPercap, y = lifeExp)) +
geom_point() + theme_void()
multipanel plots
gapminder %>%
ggplot(aes(x = year, y = lifeExp, color = continent)) +
facet_wrap(~continent) +
geom_point()
#we were masking some of Oceania so we allow various Y scales
gapminder %>%
ggplot(aes(x = year, y = lifeExp, color = continent)) +
facet_wrap(~continent, scales = "free_y") +
geom_point()
gapminder %>%
ggplot(aes(x = year, y = lifeExp, color = continent)) +
facet_wrap(~continent, scales = "free_y") +
geom_point() +
scale_y_continuous(n.breaks = 3)
ggsave
Going to export whatever plot you made last
ggsave("figures/faceted_lifeExpByYear.png", height = 5, width = 7)
#install.packages("patchwork")
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.0.2
https://gotellilab.github.io/GotelliLabMeetingHacks/NickGotelli/ggplotPatchwork.html